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Provenance, or information about the sources, derivation, custody or history of data, has been stud- 
ied recently in a number of contexts, including databases, scientific workflows and the Semantic 
Web. Many provenance mechanisms have been developed, motivated by informal notions such as 
influence, dependence, explanation and causality. However, there has been little study of whether 
these mechanisms formally satisfy appropriate policies or even how to formalize relevant motivating 
concepts such as causality. We contend that mathematical models of these concepts are needed to 
justify and compare provenance techniques. In this paper we review a theory of causality based on 
structural models that has been developed in artificial intelligence, and describe work in progress on 
using causality to give a semantics to provenance graphs. 

1 Introduction 

Provenance is a general term referring to the origin, history, chain of custody, derivation or process that 
yielded an object. In analog settings such as art and archaeology, such information is essential for un- 
derstanding whether an artifact is authentic, valuable or meaningful. In the digital world, provenance is 
now recognized as an important problem because it is very easy to silently alter or forge digital infor- 
mation. We already pay a price because of the lack of robust mechanisms for recording and managing 
provenance: serious economic losses have been incurred due to the lack of provenance on the Web ||3l, 
and lack of transparency of scientific processes and results is routinely used to sow confusion and doubt 
about climate change [25 ]. 

Much work on provenance considers the following basic scenario: we have some input data and 
some complex process that will be run on the input, for example a large program, possibly split into 
many smaller jobs and executed in parallel or on a distributed system. In this setting, a proposed solution 
generally records some additional information as the program runs. The additional information is often 
called provenance and it is supposed to provide an "explanation" showing how the results were obtained. 

Of course, this is a loose specification if we do not clarify what we mean by an explanation (apart 
from whatever provenance information happens to be recorded by the system). There are at least two 
obvious-seeming choices, neither of which seems satisfactory in practice: 

• First, we might record the program that was run along with its input data (and any intermediate 
inputs such as user or network interactions). This, at least, allows us to rerun the program later and 
check that we get the same result, and it also allows us to vary the inputs to see how changes affect 
the output. But this is nearly useless as an explanation, especially for end-users who are not (and 
should not be expected to be) proficient at debugging black-box systems. Moreover, the inputs and 
outputs may be huge (for example, gigabytes of climate data), and it may not be possible for users 
to manually inspect the data. In the longer term, just recording the program is also problematic 
since the computational environment in which the program runs will change — in some sense, this 
is also an "input" that we cannot feasibly record. 
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• Second, we might record everything that can be recorded about the computation, in the hope that 
it might someday be useful. This also sounds straightforward but is surprisingly difficult to pin 
down, since "everything that can be recorded" can be interpreted in many different ways. Should 
we record every function call? Every instruction? Every molecular interaction? Do we need 
to record what the programmer had for breakfast on the day the program was written? Clearly 
we have to stop somewhere, and for efficiency reasons we should probably stop far short of any 
reasonable definition of "everything". 

Most extant approaches pick some intermediate point between these two extremes, committing to some 
(often not explicitly stated) choices about what is important about the computation that should be recorded 
in its provenance. 

For example, in databases, there are models such as where-provenance Q (tracking the "sources" of 
copied data), lineage [lOl, why-provenance [2 1 or how-provenance fT4l (tracking tuples "used by" or that 
"justify" a result tuple), or dependency provenance [7] (a computable approximation of the information 
flow behavior of the program). 

Provenance has also been studied extensively in other settings, particularly "scientific workflow" 
systems (e.g. ll2Tl|20l[T8l ). Scientific workflows are usually high-level, visual programming languages, 
often based on dataflow or Petri-net models of concurrent computation, and often executed on grid or 
cloud computing platforms. This architecture has the advantage that it puts considerable computational 
power into the hands of scientists without forcing them to learn how to program parallel or distributed 
systems at a low level in C-i~i- or Java. However, it also has a serious drawback: distributing a program 
over a heterogeneous network dramatically increases the number of things that can go wrong, typically 
makes the computation nondeterminstic and makes it hard for the user to trust the results. 

Scientists are reluctant to publish results based on programs that may contain subtle bugs, and whose 
behavior is different every time they are run, or depends on libraries or other environmental factors 
in subtle ways. Provenance is perceived as important for helping users understand whether results of 
such computations are repeatable and trustworthy, and in particular for scientists to be able to judge the 
scientific validity of results they may wish to publish. 

The work on database provenance is distinctive in that several different formal models have now been 
defined for database query languages with well-understood semantics. This makes it easier to compare, 
relate and generalize these approaches, though such comparisons are only starting to appear |8, 14]. For 
most of these models, there are semantic guarantees (or even exact semantic characterizations) relating 
the provenance records to the denotation of the program. On the other hand, for workflow provenance, 
formal definitions of the meaning of workflow programs have only started to appear recently (see for 
example ||29l [TTl ). while the provenance semantics of these tools is usually specified informally, at 
best ||2T]| . As a result there is a confusing variety of models and styles of provenance for workflows. 

To address this problem, there has been an ongoing community effort, centered on a series of "Prove- 
nance Challenges" [23] , to understand and compare the qualitative behavior of these different systems 
and synthesize a common format for exchanging provenance among them. This effort has recently 
yielded a draft Open Provenance Model, or 0PM 1221 . Instances of this model are graphs whose nodes 
represent agents, processes or artifacts and whose edges represent dependence, generation or control re- 
lationships. The 0PM has "semantics" in the sense of the Semantic Web, in that the nodes and edges 
are expected to have names that are meaningful to reasonably well-informed users. The 0PM standard 
draws heavily on informal motivations such as "provenance is the process that led to a result" and "edges 
denote causal relationships linking the cause to the effect". But while the 0PM specifies a graph notation, 
controlled vocabulary for the edges, and inference rules for inferring new edges from existing edges, it 
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does not have a "semantics" in the denotational or operational sense by which we might judge whether a 
graph is consistent or complete or whether inferences on the graph are valid. 

In this paper, we investigate the use of structural causal models [24J as a semantics for these graphs, 
and relate the informal motivations invoked in defining 0PM graphs with formal definitions of actual 
cause and explanation recently proposed by Halpern and Pearl |[l5llT6t . We do not argue that structural 
causal models provide the only or best causal account of provenance. However, structural causal models 
are quite close to OPM-style provenance graphs (modulo cosmetic differences), so the analogy is com- 
pelling. Structural models have been studied extensively (e.g. ll24l[T5l[T6l[m[T2l[T3l ) and have proven 
useful to both philosophical accounts of scientific explanation ||30l and psychological theories of under- 
standing fTT\. Nevertheless, it may be enlightening to apply other mathematical theories of causality and 
explanation to provenance, or investigate variations and extensions of Halpern and Pearl's approach. 

The broader aim of this paper is to argue by example that semantics (in the mathematical sense) is 
badly needed for research on provenance. One of the major motivations for provenance is to improve 
scientific communication, by allowing scientists to generate and exchange computational "explanations" 
or "justifications" of their results. In biology, for example, some journals now require both data and 
workflow programs describing how published results were obtained, and some scientists anticipate that 
scientific publication will evolve into richer documents incorporating text, data, and computation ||26ll . 
However, if the techniques used to do so are poorly specified and unverified then we can expect errors 
and confusion. Programming languages and semantics researchers can and should be involved in making 
sure that these techniques are clearly described and robust, to help ensure that scientific communications 
retain long-term value as they gain computational structure. 

2 Examples 

Before delving into technical details, we give a high-level example comparing OPM-style provenance 
graphs with structural causal models. 

The left-hand side of Figure [T] shows a simple 0PM graph, based on a standard example showing the 
"provenance of a cake" [22]. The right-hand side shows a structural causal model, also depicted as a 
graph. These two graphs are intentionally very similar. In the 0PM graph, the ovals denote "artifacts" 
(flour, water, pan, cake) while the boxes denote "processes" (baking, mixing). 

The 0PM standard does not specify a semantics of provenance graphs as computations that might 
take place in the future; rather, it specifies a syntax for graphs that are supposed to describe processes 
that have taken place in the past. Nevertheless, it is natural to read an 0PM graph as an expression or 
straight-line program that can be rerun to produce the output in the same way as the original process did. 

Conversely, a causal model such as the one in Figure [T]does have an intended computational seman- 
tics. Each node in a causal model is equipped with a function specifying how to compute the value of 
the node given values for the parent nodes. An acyclic causal graph can be viewed as a piece of straight- 
line code, assigning values to variables as shown along the bottom of Figure [T] Here, we considered a 
simplistic Boolean- valued model whose transfer functions are Boolean functions. We can interpret these 
definitions as saying that if all of the needed ingredients for each step of the process are present, then 
that step will succeed. 

Causal models also typically distinguish between the set of exogenous parameters U and endogenous 
variables V . The former are meant to represent unknowns, measurement error, or environmental factors 
not otherwise taken into account in the model. Here, for example, we included one exogenous parameter 
Ui for each computation step, representing the possibility that something might go wrong. 
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Mix := {Water f\Sugar f\Eggs f\Flour f\Butter)®Ui 
Batter := Mix®U2 
Bake : = {Batter A Pan) ® U^, 
Cake '.= Bake(BU4 

Figure 1 : Top left: An 0PM graph. Top right: the corresponding causal model. Bottom: a straight-line 
code description of both processes. Variables Ui , U2, Uj,,Ua are exogenous variables (error terms, inputs) 
abstracting away unknown environmental influences that are not expUcitly modeled. 

Causal models are closely related to Bayesian networks, and causal models can also be given a 
probabilistic interpretation. For example, to model a scenario where the oven explodes and destroys the 
cake with probability 0.01, we can employ a probability distribution giving Uj, the expected value 0.01. 

In artificial intelligence, probabilistic causal models are used either to represent domain knowledge 
that enables a system to reason about cause and effect, or as a representation of hypotheses about some 
data whose behavior is believed to be causal. Bayesian inference can then be used to leam causal models 
from data. This kind of inference is appropriate for data generated by controlled experiments where 
a single parameter can be varied to identify cause-effect relationships, but can also be used to analyze 
situations with less ideal experimental designs [24]. 

In this paper, however, we will consider only the deterministic form of causal models and focus 
on their use as a semantic tool, not on inference of causal models. We will show how to interpret a 
provenance graph as a causal model and relate syntactic and semantic techniques for reasoning about 
causality in provenance graphs. 



3 Provenance Graphs 

For the purposes of this paper, we will employ a simplified model of OPM-style provenance graphs, 
which we will just call provenance graphs. Fix a set of data values D, and a set of process names P. 
Recall that a bipartite graph G = {V,W,E) is a (here, directed) graph with vertices V U W and edges 
E C {V X W) U {W xV). A provenance graph is a bipartite graph in which the vertices are labeled. 
We call the vertices of V the artifacts and the call vertices of W the processes, and the artifact nodes 
are labeled with data values and the process nodes are labeled with process names. Intuitively, an edge 
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{a,p) G E means that a was generated by process p, also written a wasGeneratedBy p, and an edge 
{p,a) G E means that a was used by process p, also written p used a. 

We call a provenance graph functional if for each process node p G W there is exactly one artifact 
aeV such that {p,a) eE. We also call a provenance graph sorted if there is a function ar : P ^ N such 
that each process node x has flr(x) inputs (with in-edges labeled l,...n). Note that a sorted functional 
provenance graph is essentially a first-order term with sharing, i.e. a first-order term with nonrecursive 
let-binding. However, for the purposes of this paper this sharing is important and it is more convenient 
to think in terms of graphs. 

In this paper, we consider the simple case where provenance graphs are functional and both prove- 
nance graphs and causal graphs are acyclic. (Note that acyclic causal models are called "recursive" in 
the causal models literature.) 

Now let us consider the provenance recording scenario described in the introduction: say we have 
some program that computes a function / : D" — )• D whose implementation is unknown. We assume a 
fixed set of artifact nodes V that includes at least n input nodes vi , . . . , v„ and a distinguished result node 
vq. a provenance graph semantics for / is a function P(/) : Z)" — > PG{y), where PG is the set of all 
provenance graphs over V. 

What properties should a provenance graph semantics have, beyond simply assigning each input a 
graph? We argue that there should be some relationship between the function / being described and 
the "meaning" of the provenance graph. But for this to be sensible, we first need to assign the graph a 
computational meaning. 

Given a sorting ar, an interpretation \—\ associates each process name with a function \p^ : D'"'^'') 
D. (Such an interpretation is essentially an algebra over the first-order signature {P,ar).) Given a graph 
G with n input nodes vi, . . . , v„, we can therefore define a function [[G| : D" ^ D that takes a tuple of 
values for the input nodes, appUes the interpretation functions to calculate the values for all of the nodes, 
and finally returns the value of the result node vq. 

Given a provenance semantics P(/) for an unknown function /, we say that P(/) is a pointwise 
approximation of / if for each input u, we have [P (/) (m)] (m) = f{U). In other words, we can recompute 
the value of / at w directly using the provenance graph returned for u. 

Most graph-based provenance semantics implicitly satisfy this property. But observe that there is al- 
ways a trivial pointwise provenance semantics defined by taking P (/) (m) to be a graph where the result 
node has no connection to the inputs and is labeled with /(w). This is a pointwise approximation, but 
gives us no insight into how the result depends on the inputs. Thus, capturing the original function point- 
wise is intuitively necessary, but not sufficient for the provenance graph to be informative. Something 
more is needed. 

Given a provenance semantics P() for /, we say that P() approximates / globally provided that 
for each u, we have [P (/)(«)! = /. In other words, a global approximation yields a provenance graph 
that simulates / on all inputs. Note that a global approximation exists if (and only if) the language of 
provenance graphs is rich enough to express /. Specifically, we define P (/) as the constant function that 
returns a provenance graph that simply says "apply /". Of course, if / is a large, opaque program then 
this may not be helpful. 

The global criterion seems very strong: it requires the provenance graph to describe what would 
happen under an arbitrary change to the input, which may be far more information than we need to 
understand just one run of the program. Moreover, a global approximation semantics may not exist for 
reasonable combinations of provenance graph languages and semantics. For example, if the provenance 
graphs can make use of only arithmetic functions then we cannot globally approximate the behavior of 
a program that calculates a properly recursive function such as factorial — there is no fixed arithmetic 
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circuit that calculates factorials. On the other hand, for certain simple classes of workflows, global 
approximation is possible, and clearly desirable. 

Instead of an all-or-nothing correctness property, we now consider the possibility of measuring how 
powerful a provenance semantics is as a way of predicting the function it is meant to represent. 

Let P() be a provenance semantics for /. We define a relation on input tuples U as follows: 

u^w ^ IP {/mm =f{l^) 

In other words, u~^u' holds just in case the provenance graph for / on m correctly predicts the value of 
/ on u' . That is, the relation ^ quantifies how the generality of the provenance graphs produced by /. 
We call the relation ^ the predictive power of provenance semantics P. 

Observe that is reflexive if and only if P() is pointwise approximation, and it is total if and only 
if P() is a global approximation. Furthermore, given two provenance semantics Pi() and P2(), we can 
compare them by comparing their predictive power --^i and Specifically, if then PiO has 

at least as much predictive power as Pi(), which we could also write as Pi() < P2()- Note that two 
provenance semantics might have equal predictive power but be very different as functions. 

Given a language of provenance graphs and a fixed interpretation of the process names they may 
use, consider the problem of designing a provenance semantics for a given function / whose predictive 
power is as great as possible. Some interesting questions we will leave for future work include: Is there 
a unique "most powerful" semantics? Given the program text of /, is there an effective way to produce 
the most powerful (or a reasonably good) semantics? 



4 Causal models 

In the discussion so far, the function / being investigated is essentially a black box — we can provide 
inputs and observe the output, and nothing else. There may be intermediate states and values calculated 
by / that are not visible externally, nor can we gain understanding of / by artificially altering intermediate 
states. Understanding can often be aided by considering hypothetical changes in the middle of a process. 
For example, suppose / is defined as follows: 

f{x,y,z) - 



Then /(l,0,z) is undefined (division by zero), but there is no way to change just one input variable to 
a value such that / is defined. Of course, inspecting the definition of the function makes it obvious that 
both X and y need to change, but if / is a black box and x and y are just two of many input parameters then 
finding a way to change the inputs to fix the problem would require many blind guesses. If we could see 
inside of / and possibly change some intermediate error results to dummy valid results, then we might 
be able to find the problem more easily. 

Structural causal models are (from a certain point of view) models of computation that permit some 
inspection and intervention upon the intermediate results. We will now review the definitions of causal 
models and actual causes from lITSl IT6ll . Due to space limits, we unfortunately cannot provide ade- 
quate explanation and motivation for these definitions, but Pearl Ii24tt provides a great deal of additional 
background and motivation for causal models. 

A causal model M = {U,V,F) is a. structure where ?7 is a set of inputs (or exogenous variables), V is a 
set of (endogenous) variables, and F is a family of functions Fx : (D^^^ — )• D) mapping each vertex X in 
V to a function from tuples of data values to data values indexed byV.A causal graph G = {UUV,E) for 
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M is a directed graph wliose vertices are inputs or events of M and such that Fx depends only on the values 
corresponding to parent vertices of X in G. We restrict attention to causal models whose underlying graph 
is acyclic (these are sometimes called "recursive" models in the structural-models literature). For each 
acyclic causal model there is a unique least causal graph (up to isomprphism) expressing all and only the 
true dependencies of the functions Fx- 

A valuation is simply a V-indexed tuple of values in D, or a : . We say a valuation is consistent 
with M if for each X we have F{X){a) = CJ(X). A causal model paired up with a (consistent) valuation 
is called a (consistent) causal situation. 

Interventions are a distinctive feature of causal models. Given a model M, we can form another 
model M[x:=x] by fixing the value of vertex X to x and making the value of X independent of its former 
inputs. This reflects the fact that although the causal model represents the behavior of a closed system, 
we can reach into the system (at least as a thought experiment) and change X to a value of our choice. 

Definition 1. If M = (y,U,F) is a causal model, then the result of setting X to x in M is M^x:=x] = 
{V,U,F'), where: 

W/ ^ i X \X = Y 

If the appropriate variable X is clear from context, we may write just instead of Mx;=x] ; similarly, for 
a sequence of interventions X := x we may just write Mf for Mj^_^j . 

We also review Halpern and Pearl's definition of "actual cause". 

Definition 2 (Actual cause). Let (M, o) be a causal situation. Let X be a subset of V and 7 G V, and 

suppose X = o{X) and y = o{Y). Suppose that: 

1. o{X) =f and a(F) =y. 

2. Some set of variables W —X and values x' G D, and w' ^ D exist such that: 

(a) Y j^y holds in M^i^^i 

(b) Y = y holds in M^^a',z for all Z C y - (X U W), where z are the values of Z in M. 

Then we say that X = x is a weak cause ofY=y. Moreover, if no proper subset of X = x is a weak cause, 
then X = X is an actual cause ofY =y. 

The definition of actual cause deserves explanation. Briefly, the idea is that an actual cause must 
first describe the true state of affairs (part 1). Secondly, there must be a way to change the value of Y 
by changing the values of X (possibly plus some other variables W), and there must not be a way of 
changing the value of Y by fixing the values of X = x, W = w' and setting any other variables to their 
original values. (In fact, the variables Z are usually between X and Y.) 

For additional discussion we must refer the reader to Halpern and Pearl [15]. The discussion of actual 
causes later in this paper does not depend heavily on the details of this definition. 

5 Causal interpretations of provenance graphs 

Given a fixed interpretation of the process identifiers of a provenance graph G = {V, W,E), we can define 
a causal model and a valuation (M^, cTg). Suppose V = U' where U' is the set of input nodes in G 
(that is, artifact nodes not generated by any process). Let = {W ,V' UW,F), where we define Fy{a) 
for V G y' as the (unique) value of G{p) where p is the process node that generates v', and Fp{a) is the 
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interpretation function for node p in G, lifted to apply to a by ignoring all values of nodes not used by p 
in (7. Moreover, we define the valuation Gg for Mq by taking (7g(v) to be the data value assigned to v in 
G. 

Just as the functional interpretation of a provenance graph can be viewed as a local approximation of 
some unknown (functional) process, its causal model interpretation can be viewed as an approximation 
of some unknown (causal) process. To make this idea precise, we introduce an abstract notion of "causal 
functions" that support both ordinary evaluation and intervention. 

A causal function over variables {U,V) is a family of functions fj : — > D^, one for each partial 
valuation z .V ^ D. We write / for /g, and we require that for each partial valuation T and input 
valuation a we have T C f^{(y). Thus, for example, f{o) computes all of the values of variables directly 
from the inputs without any interference, while f[x:=x,Y:=y\ ip) calculates the values when X is forced to 
be X and Y is forced to be y. Note that a causal model M defines a unique causal function, which we will 
write [M] . On the other hand, the definition of causal function makes sense without talking about the 
specific propagation mechanism used to calculate the values. 

Suppose that / is a causal function with inputs U and variables V, and let P (/) be a function mapping 
input values u G to causal models P (/) (m) over inputs JJ and variables V. Then [P (/) {u)\ is again a 
causal function for each u. 

We say that a provenance semantics P () approximates / pointwise if for any u, we have [P (/) (m)]] (m) = 
/(m). That is, both the end results and intermediate states of P(/) agree with /. As with pointwise 
approximations in the functional case, this is a rather weak specification, since it is satisfied by defin- 
ing P (/) (m) as a disconnected causal process that simply defines each variable X G V to be a constant 

mm- 

Likewise, we can define P(/) to be a global approximation of / when [P(/)(m)] = / for any u. 
Again, this is very strong, since it says that / must be expressible as a single causal model — that is, the 
provenance graph for / tells us everything about the causal structure of /. 

Casual functions admit a third natural notion of approximation, which is strictly in between pointwise 
and global approximation. We say that provenance semantics P (/) approximates / locally if for any u, 
we have 

That is, approximating / locally does not require that showing how / would behave on arbitrary inputs, 
but does ensure that we can predict how the particular run of f{u) would change if we had intervened by 
fixing the values of the variables in V . 

Local approximation is strictly stronger than pointwise approximation: clearly local implies point- 
wise, but for a simple process such as7:=Z-|-l;Z:=7*2the trivial pointwise approximation is not a 
local approximation. The requirement to handle arbitrary interventions forces us to do more than simply 
record the actual values of the intermediate variables; instead, we must record at least some information 
about how they were related. 

On the other hand, global approximation obviously implies local approximation, but not the converse. 
Local approximation is strictly weaker than global approximation, since there is no requirement that the 
causal model obtained for a run on input u will be useful for predicting results or causal structure of / 
running on another input. This allows different causal models to be used for different inputs, as long as 
the causal structure of the model chosen for / only depends on the input parameters, not intermediate 
variable values. 

For example, consider f{u\x,y) = {x+y)", where the provenance graph may only employ multi- 
plication and addition. Then there is no single provenance graph (over basic arithmetic functions) that 
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approximates / globally, but for each choice of u there is a graph that describes the causal function 

x,yi-^ f{u;x,y). 

As with the functional case, we might also expect a causal model to have partial explanatory power 
for other input settings. Specifically, we define the predictive power of P (/) as the relation: 



In other words, u u' holds when the causal model generated from the provenance of it is still a faithful 
model of the behavior of / at W under arbitrary interventions. 

For this definition of predictive power, local and global approximation correspond to reflexivity and 
totahty of As with the functional case, we can compare provenance semantics by their predictive 
power. 

6 Inferences about causality in provenance 

The OPM standard also describes inferences on provenance graphs. These inferences are formahzed as 
Datalog-style inference rules that allow us to infer new edges from existing edges. For our provenance 
graphs, the relevant rules from the OPM standard include: 

xwasDerivedPromy :— x wasGeneratedBy ;? A;? used y 

;? wasTriggeredBy ^7 :— ;? used x Ax wasGeneratedBy ^ 

X wasDerivedFrom^ y :— x wasDerivedFrom yV {x wasDerivedFrom zAz wasDerivedFrom^ y) 

p wasTriggeredBy"*" q :— p wasTriggeredBy {p wasTriggeredBy r A r wasTriggeredBy"*" q) 

For example, if process p "used" artifact x which "was generated by" process q, then we infer that p 
"was triggered by" q. However it is important to note that the terms "used", "was generated by" and 
"influenced" are simply different edge labels whose relationship is being axiomatized by the Datalog 
rules — there is no coimection in the standard to the causal-model interpretation of provenance graphs 
introduced in this paper. 

We want to give these edge relations meaning in terms of the causal model interpretation, and use this 
as a basis forjudging the correctness of inferences. In particular, we would like the edges to correspond 
to "actual causes", as they are expected to do in OPM. 

Consider a naive interpretation in which we interpret p used x as meaning that x = a (x) was part of an 
actual cause of p = o{p) in Mq, and likewise interpret x wasGeneratedBy p as meaning that p = o{p) 
was part of a actual cause of x = a(x) in Mq. However, this is a little too naive, since it is possible 
for X = X to actually cause Y = y and 7 = y to actually cause Z = z, while in OPM graphs, the used 
and wasGeneratedBy edges should not have this behavior. We can fix this by further constraining these 
relations to hold only when there is no other actual cause "between" p and x. Using these definitions 
of the basic edges, we can safely extrapolate meanings for the other edges using the Datalog rules: for 
example, x wasDerivedFrom y means that 7 = y is part of an actual cause of X = x and there is no other 
artifact strictly between x and y, and wasDerivedFrom^ is the transitive closure, which (for a finite causal 
model) is equivalent to saying that 7 = y is part of an actual cause of X = x without any constraints. (We 
leave these observations as conjectures for now.) 

However, there is an important problem with the above idea: namely, we have defined the meanings 
of the edges in terms of the definition of actual cause, not in terms of the presence or absence of edges in 
M. Although every actual cause relationship corresponds to an edge, not every edge syntactically present 
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in Mq necessarily corresponds to an immediate actual cause. In a specific situation, some possible 
flows of information may not actually be causally relevant. In other words, the Datalog rules over the 
provenance graph are complete, but unsound, with respect to actual causality; the naive approach of 
following all syntactic links to find all nodes reachable from a given node is a safe over-approximation 
(but it may be very inaccurate). 

To clarify this point, we review some of the complexity results of Eiter and Lukasiewicz ifTTI . For 
deterministic causal models, they study the complexity of several definitions of causality, including the 
"actual cause" definition introduced by Halpern and Pearl |[T5l . They show that determining whether 
a set of variables and values is an actual cause of another event is NP-hard, even for Boolean causal 
models. These complexity lower bounds hold whether or not the function definitions are considered part 
of the problem, since in the Boolean case the amount of information needed to describe the operators is 
small. Datalog queries can be computed in polynomial time |[T9]| . so cannot in general correctly calculate 
actual causes. 

7 Related and future work 

As discussed in the introduction, we draw on previous work on causality based on structural models in 
artificial intelligence (see for example [24] for a comprehensive discussion of this work and its appli- 
cations). The worst-case complexity and tractable special cases of computing actual causes and causal 
explanations are comprehensively studied by a series of papers by Eiter and Lucasiewicz lfTTl[T2l[T3]| . As 
we have noted, these have immediate applications to showing that simple transitive closure algorithms 
for extracting actual causes from provenance cannot suffice for provenance inference, at least if Halpern 
and Pearl's definitions are used. 

Halpern gave a well-received invited presentation on causality, responsibility and blame at a recent 
workshop on provenance (5). The analogy between causal models and provenance has been "in the air" 
since then, and has been discussed in more recent work by Chapman et al. L4J and an overview paper by 
Cheney et al. H 

The treatment of causality here plays a similar role to dependence in previous work with Acar and 
Ahmed on dependency provenance fj\, and subsequently on a "complete provenance trace" model lH. 
These models support some inference concerning how hypothetical changes to (parts of) the input affect 
(parts of) the output. However, these techniques are based on a database query language supporting 
arbitrarily nested sets and records, which do not seem to match structural causal models over atomic 
data. Acar et al. 1 1 ] presents a translation from a core database query language to OPM-style provenance 
graphs; the graphs in that work do not have a causal or computational interpretation, but exploring this 
possibility is an obvious next step. 

In this paper, we have focused on a (perhaps deceptively) simple setting where the provenance graphs 
are directly analogous to causal models. It is not clear how far this analogy can be pushed using known 
forms of causal models; it is possible that extensions to the causal framework are needed to provide 
semantics for provenance models describing richer computational models. For example, matters likely 
become more interesting in the presence of concurrent processes that may interact with one another (see 
for example [28]). This is a common case in scientific workflow provenance and causal provenance 
semantics should be developed for (or adapted to) models such as dataflow networks or process. 

There are several other developments in structural causal models that would be interesting to ex- 
plore from the point of view of provenance, including using probabilistic causal models to explain how 
provenance can make unreliable computations statistically "repeatable", and using existing definitions of 
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explanation, responsibility and blame for provenance. 

Finally, as illustrated by the "cake" example, provenance records sometimes employ an implicit 
notion of state and resource consumption. For example, in baking a cake, some inputs are "consumed" 
(e.g. the ingredients), while others are only catalysts (e.g. the cake pan). The provenance model also 
does not reflect time or resource usage. Of course, this kind of information can be added to the data 
model (and the 0PM draft already allows for both time and domain- specific annotations), but it would 
be interesting to explore applications of linear or resource-sensitive logics to this problem. 

8 Conclusions 

We have explored an analogy between the OPM-style provenance graphs that are becoming popular 
in scientific workflow systems (and other settings) and structural causal models as developed by Pearl, 
Halpem and others in artificial intelligence. In particular, a provenance graph can be interpreted as a 
causal model in a natural way, provided that we know how to interpret the processing steps as func- 
tions. We considered the problem of using provenance graphs to represent particular runs of a program. 
Existing definitions and results concerning causal models can be lifted to provenance graphs. 

We also investigated the problem of inferring actual causes over provenance graphs using Datalog- 
style rules. As shown by Eiter and Lucasiewicz llT2l . the complexity of computing or calculating ex- 
planations in the Halpem-Pearl approach is quite high (usually at least NP-hard), which shows that, for 
example, queries about actual cause or explanation cannot be expressed using Datalog rules, even in the 
special case of provenance graphs over Boolean functions. This shows that Halpern and Pearl's defini- 
tions of actual cause and the ad hoc rules that have been proposed for reasoning about provenance graphs 
are incompatible. 

These observations are not deep — they amount to applying known results in the structural causality 
literature to analyze informal claims in the provenance literature. Adopting a different model of causality 
might lead to different conclusions. Nevertheless, we believe that the exercise of trying to formalize 
the relationship between provenance and causal explanation has not been seriously tried before and is 
worthwhile, if for no other reason than to provoke others to do a better job. 
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